Evolving accurate and compact classification rules with gene expression programming

نویسندگان

  • Chi Zhou
  • Weimin Xiao
  • Thomas M. Tirpak
  • Peter C. Nelson
چکیده

Classification is one of the fundamental tasks of data mining. Most rule induction and decision tree algorithms perform local, greedy search to generate classification rules that are often more complex than necessary. Evolutionary algorithms for pattern classification have recently received increased attention because they can perform global searches. In this paper, we propose a new approach for discovering classification rules by using gene expression programming (GEP), a new technique of genetic programming (GP) with linear representation. The antecedent of discovered rules may involve many different combinations of attributes. To guide the search process, we suggest a fitness function considering both the rule consistency gain and completeness. A multiclass classification problem is formulated as multiple two-class problems by using the one-against-all learning method. The covering strategy is applied to learn multiple rules if applicable for each class. Compact rule sets are subsequently evolved using a two-phase pruning method based on the minimum description length (MDL) principle and the integration theory. Our approach is also noise tolerant and able to deal with both numeric and nominal attributes. Experiments with several benchmark data sets have shown up to 20% improvement in validation accuracy, compared with C4.5 algorithms. Furthermore, the proposed GEP approach is more efficient and tends to generate shorter solutions compared with canonical tree-based GP classifiers.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

Evolving Multi-label Classification Rules with Gene Expression Programming: A Preliminary Study

The present work expounds a preliminary work of a genetic programming algorithm to deal with multi-label classification problems. The algorithm uses Gene Expression Programming and codifies a classification rule into each individual. A niching technique assures diversity in the population. The final classifier is made up by a set of rules for each label that determines if a pattern belongs or n...

متن کامل

Interpretable gene expression classifier with an accurate and compact fuzzy rule base for microarray data analysis.

An accurate classifier with linguistic interpretability using a small number of relevant genes is beneficial to microarray data analysis and development of inexpensive diagnostic tests. Several frequently used techniques for designing classifiers of microarray data, such as support vector machine, neural networks, k-nearest neighbor, and logistic regression model, suffer from low interpretabili...

متن کامل

Evolving Text Classification Rules with Genetic Programming

We describe a novel method for using Genetic Programming to create compact classification rules using combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classifica...

متن کامل

Estimating the Saturated Hydraulic Conductivity of Soil Using Gene Expression Programming Method and Comparing It with the Pedotransfer Functions

Saturated hydraulic conductivity of soil is an important physical property of soil that affects water movement in soil, Since the measurement of saturated hydraulic conductivity by direct methods in the field or in the laboratory is hard, time-consuming and costly, the indirect methods are being used.The aim of this study is to estimate the saturated hydraulic conductivity from other soil prope...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Trans. Evolutionary Computation

دوره 7  شماره 

صفحات  -

تاریخ انتشار 2003